Speech recognition device and method

ABSTRACT

In a speech recognition device ( 1 ) for recognizing text information (TI) corresponding to speech information (SI), wherein speech information (SI) can be characterized in respect of language properties, there are firstly provided at least two language-property recognition means ( 20, 21, 22, 23 ), each of the language-property recognition means ( 20, 21, 22, 23 ) being arranged, by using the speech information (SI), to recognize a language property assigned to said means and to generate property information (ASI, LI, SGI, CI) representing the language property that is recognized, and secondly there are provided speech recognition means ( 24 ) that, while continuously taking into account the at least two items of property information (ASI, LI, SGI, CI), are arranged to recognize the text information (TI) corresponding to the speech information (SI).

The invention relates to a speech recognition device for recognizingtext information corresponding to speech information.

The invention further relates to a speech recognition method forrecognizing text information corresponding to speech information.

The invention further relates to a computer program product that isarranged to recognize text information corresponding to speechinformation.

The invention further relates to a computer program product that runsthe computer program product detailed in the previous paragraph.

A speech recognition device of the kind specified in the first paragraphabove, a speech recognition method of the kind specified in the secondparagraph above, a computer program product of the kind specified in thethird paragraph above and a computer of the kind specified in the fourthparagraph above are known from patent WO 98/08215.

In the known speech recognition device, speech recognition means areprovided to which speech information is fed via a microphone. The speechrecognition means are arranged to recognize the text information in thespeech information while continuously taking into account propertyinformation that represents the context to be used at the time forrecognizing the text information. For the purpose of generating theproperty information, the speech recognition means has language-propertyrecognition means that are arranged to receive a representation of thespeech information from the speech recognition means and, by using thisrepresentation of the speech information, to recognize the context thatexists at the time as a language property that characterizes the speechinformation and to generate the property information that represents thecurrent context.

In the known speech recognition device, there is the problem thatalthough provision is made for the recognition of a single languageproperty that characterizes the speech information, namely for therecognition of the context that exists at the time, other languageproperties that characterize the speech information, such as speechsegmentation, or the language being used at the time, or the speakergroup that applies at the time, are not taken into account during therecognition of the text information. These language properties that areleft out of account therefore need to be known beforehand before use ismade of the known speech recognition device and, in the event thatallowance can in fact be made for them, have to be preconfigured, whichmay mean they have to be preset to fix values, i.e. to be unalterable,which makes it impossible for the known speech recognition device to beused in an application where these language properties that cannot betaken into account change during operation, i.e. while the textinformation is being recognized.

It is an object of the invention to overcome the problem detailed abovein a speech recognition device of the kind specified in the firstparagraph above, in a speech recognition method of the kind specified inthe second paragraph above, in a computer program product of the kindspecified in the third paragraph above and in a computer of the kindspecified in the fourth paragraph above, and to provide an improvedspeech recognition device, an improved speech recognition method, animproved computer program product and an improved computer.

To achieve the object stated above, features according to the inventionare provided in a speech recognition device according to the invention,thus enabling a speech recognition device according to the invention tobe characterized in the manner stated below, namely:

A speech recognition device for recognizing text informationcorresponding to speech information, which speech information can becharacterized in respect of language properties, wherein firstlanguage-property recognition means are provided that, by using thespeech information, are arranged to recognize a first language propertyand to generate first property information representing the firstlanguage property that is recognized, wherein at least secondlanguage-property recognition means are provided that, by using thespeech information, are arranged to recognize a second language propertyof the speech information and to generate second property informationrepresenting the second language property that is recognized, andwherein speech recognition means are provided that are arranged torecognize the text information corresponding to the speech informationwhile continuously taking into account at least the first propertyinformation and the second property information.

To achieve the object stated above, features according to the inventionare provided in a speech recognition method according to the invention,thus enabling a speech recognition method according to the invention tobe characterized in the manner stated below, namely:

A speech recognition method for recognizing text informationcorresponding to speech information, which speech information can becharacterized in respect of language properties, wherein, by using thespeech information, a first language property is recognized, whereinfirst property information representing the first language property thatis recognized is generated, wherein at least one second languageproperty is recognized by using the speech information, wherein secondproperty information representing the second language property that isrecognized is generated, and wherein the text information correspondingto the speech information is recognized while continuously taking intoaccount at least the first property information and the second propertyinformation.

To achieve the object stated above, provision is made in a computerprogram product according to the invention for the computer programproduct to be able to be loaded directly into a memory of a computer andto comprise sections of software code, it being possible for the speechrecognition method according to the invented device to be performed bythe computer when the computer program product is run on the computer.

To achieve the object stated above, provision is made in a computeraccording to the invention for the computer to have a processing unitand an internal memory and to run the computer program product specifiedin the previous paragraph.

By the making of the provisions according to the invention, theadvantage is obtained that reliable recognition of text information inspeech information is ensured even when there are a plurality oflanguage properties that alter during the recognition of the textinformation. This gives the further advantage that the accuracy ofrecognition is considerably improved because mis-recognition of the textinformation due to failure to take into account an alteration in alanguage property can be reliably avoided by the generation and takinginto account of the at least two items of property information, as aresult of the fact that any alteration in either of the languageproperties is immediately represented by an item of property informationassociated with this language property and can therefore be taken intoaccount while the text information is being recognized. The furtheradvantage is thereby obtained that, by virtue of the plurality of itemsof property information available, considerably more exact modeling ofthe language can be utilized to allow the text information to berecognized, which makes a positive contribution to the accuracy withwhich the language properties are recognized and consequently to therecognition of the text information too and, what is more, to the speedwith which the text information is recognized as well. A furtheradvantage is obtained in this way, namely that it becomes possible forthe speech recognition device according to the invention to be used inan area of application that makes the most stringent demands on theflexibility with which the text information is recognized, such as forexample in a conference transcription system for automaticallytranscribing speech information occurring during a conference. In thisarea of application, it is even possible to obtain recognition of thetext information approximately in real time, even where the speechinformation that exists is produced by different speakers in differentlanguages.

In the solutions according to the invention, it has also provedadvantageous if, in addition, the features detailed in claim 2 and claim7 respectively, are provided. This gives the advantage that thebandwidth of an audio signal that is used for the reception of thespeech information, where the bandwidth of the audio signal is dependenton the particular reception channel, can be taken into account in therecognition of the property information and/or in the recognition of thetext information.

In the solutions according to the invention, it has also provedadvantageous if, in addition, the features detailed in claim 3 and claim8 respectively, are provided. This gives the advantage that part of thespeech information is only processed by the speech recognition means ifvalid property information exists for said part of the speechinformation, i.e. if the language properties have been determined forsaid part, thus enabling any unnecessary wastage or taking up ofcomputing capacity, i.e. of so-called system resources, required for therecognition of text information to be reliably avoided.

In the solutions according to the invention, it has also provedadvantageous if, in addition, the features detailed in claim 4 and claim9 respectively, are provided. This gives the advantage that it becomespossible for the at least two language-property recognition means toinfluence one another. This gives the further advantage that it becomespossible for the individual language properties to be recognizedsequentially in a sequence that is helpful for the recognition of thelanguage properties, which makes a positive contribution to the speedand accuracy with which the text information is recognized and allowsimproved use to be made of the computing capacity.

In the solutions according to the invention, it has also provedadvantageous if, in addition, the features detailed in claim 5 and claim10 respectively, are provided. This gives the advantage that it becomespossible for the given language property to be recognized as a functionof the other language property in as reliable a way as possible, becausethe other language property that can be used to recognize the givenlanguage property is only used if the property information thatcorresponds to the other language property, i.e. the language propertythat needs to be taken into account, is in fact available.

In a computer program product according to the invention, it has alsoproved advantageous if, in addition, the features detailed in claim 11are provided. This gives the advantage that the computer program productcan be marketed, sold or hired as easily as possible.

These and other aspects of the invention are apparent from and will beelucidated with reference to the embodiments described hereinafter, towhich however it is not limited.

In the drawings:

FIG. 1 is a schematic view in the form of a block circuit diagram of aspeech recognition device according to one embodiment of the invention,

FIG. 2 shows, in a similar way to FIG. 1, audio preprocessor means ofthe speech recognition device shown in FIG. 1,

FIG. 3 shows, in a similar way to FIG. 1, feature-vector extractionmeans of the speech recognition device shown in FIG. 1,

FIG. 4 shows, in a similar way to FIG. 1, reception-channel recognitionmeans of the speech recognition device shown in FIG. 1,

FIG. 5 shows, in a similar way to FIG. 1, first language-propertyrecognition means of the speech recognition device shown in FIG. 1,

FIG. 6 shows, in a similar way to FIG. 1, second language-propertyrecognition means of the speech recognition device shown in FIG. 1,

FIG. 7 shows, in a similar way to FIG. 1, third language-propertyrecognition means of the speech recognition device shown in FIG. 1,

FIG. 8 shows, in a similar way to FIG. 1, fourth language-propertyrecognition means of the speech recognition device shown in FIG. 1,

FIG. 9 shows, in a similar way to FIG. 1, speech recognition means ofthe speech recognition device shown in FIG. 1,

FIG. 10 shows, in a similar schematic way in the form of a bar-chart, aplot over time of the activities of a plurality of recognition means ofthe speech recognition device shown in FIG. 1,

FIG. 11 shows, in a similar way to FIG. 1, a detail of the audiopreprocessor means shown in FIG. 1,

FIG. 12 shows, in a similar way to FIG. 1, a logarithmic filter bankstage of the feature-vector extraction means shown in FIG. 3,

FIG. 13 shows, in a similar way to FIG. 1, a music recognition stage ofthe first language-property recognition means shown in FIG. 5,

FIG. 14 shows, in a similar way to FIG. 1, a second training stage ofthe second language-property recognition means shown in FIG. 6,

FIG. 15 shows, in a similar way to FIG. 1, a fourth training stage ofthe third language-property recognition means shown in FIG. 7,

FIG. 16 shows, in a similar way to FIG. 1, a sixth training stage of thefourth language-property recognition means shown in FIG. 8.

Shown in FIG. 1 is a speech recognition device 1 that is arranged torecognize text information TI corresponding to speech information TI,and that forms a conference transcription device by means of which thespeech information SI that occurs at a conference and is produced byconference participants when they speak can be transcribed into textinformation TI.

The speech recognition device 1 is implemented in the form of a computer1A, of which only the functional assemblies relevant to the speechrecognition device 1 are shown in FIG. 1. The computer 1A has aprocessing unit that is not shown in FIG. 1 and an internal memory 1B,although only the functions of the internal memory 1B that are relevantto the speech recognition device 1 will be considered in detail below inconnection with FIG. 1. The speech recognition device 1 uses theinternal memory 1B to recognize the text information 1B corresponding tothe speech information S1. The computer runs a computer program productthat can be loaded directly into the memory 1B of the computer 1A andthat has sections of software code.

The speech recognition device 1 has reception means 2 that are arrangedto receive speech information SI and to generate and emit audio signalsAS representing the speech information SI, an audio signal AS bandwidththat affects the recognition of the speech information SI beingdependent on a reception channel or transmission channel that is used toreceive the speech information SI. The reception means 2 have a firstreception stage 3 that forms a first reception channel and by means ofwhich the speech information SI can be received via a plurality ofmicrophones 4, each microphone 4 being assigned to one of the conferenceparticipants present in a conference room, by whom the speechinformation SI can be generated. Associated with the microphones 4 is aso-called sound card (not shown in FIG. 1) belonging to the computer 1A,by means of which the analog audio signals AS can be converted intodigital audio signals AS. The reception means 2 also have a secondreception stage 5 that forms a second reception channel and by means ofwhich the speech information SI can be received via a plurality ofanalog telephone lines. The reception means 2 also have a thirdreception stage 6 that forms a third reception channel and by means ofwhich the speech information SI can be received via a plurality of ISDNtelephone lines. The reception means 2 also have a fourth receptionstage 7 that forms a fourth reception channel and by means of which thespeech information SI can be received via a computer data network bymeans of a so-called “voice-over-IP” data stream. The reception means 2are also arranged to emit a digital representation of the audio signalAS received, in the form of a data stream, the digital representation ofthe audio signal AS having audio-signal formatting corresponding to thegiven reception channel and the data stream having so-called audioblocks and so-called audio headers contained in the audio blocks, whichaudio headers specify the particular audio-signal formatting.

The speech recognition device 1 also has audio preprocessor means 8 thatare arranged to receive the audio signal AS emitted by the receptionmeans 2. The audio preprocessor means 8 are further arranged to convertthe audio signal AS received into an audio signal PAS that is formattedin a standard format, namely a standard PCM format, and that is intendedfor further processing, and to emit the audio signal PAS. For thispurpose, the audio preprocessor means 8 shown in FIG. 2 have a coderecognition stage 9, a first data-stream control stage 10, a decodingstage 11, a decoding algorithm selecting stage 12, a decoding algorithmstorage stage 13, and a high-pass filter stage 14. The audio signal ASreceived can be fed directly to the first data-stream control stage 10.The audio headers can be fed to the code recognition stage 9. Byreference to the audio headers, the code recognition stage 9 is arrangedto recognize a possible coding of the audio signal AS represented by theaudio blocks and, when a coding is present, to transmit code recognitioninformation COI to the decoding algorithm selecting stage 12. When acoding is present, the code recognition stage 9 is also arranged totransmit data-stream influencing information DCSI to the firstdata-stream control stage 10, to allow the audio signal AS fed to thefirst data-stream control stage 10 to be transmitted to the decodingstage 11. If the audio signal AS is not found to have a coding, the coderecognition stage 9 can control the data-stream control stage 10, bymeans of the data-stream influencing information DCSI, in such a waythat the audio signal AS can be transmitted direct from the data-streamcontrol stage 10 to the high-pass filter stage 14.

The decoding algorithm storage stage 13 is arranged to store a pluralityof decoding algorithms. The decoding algorithm selecting stage 12 isimplemented in the form of a software object that, as a function of thecode recognition information COI, is arranged to select one of thestored decoding algorithms and, by using the decoding algorithmselected, to implement the decoding stage 11. The decoding stage 11 isarranged to decode the audio signal AS as a function of the decodingalgorithm selected and to transmit a code-free audio signal AS to thehigh-pass filter stage 14. The high-pass filter stage 14 is arranged toapply high-pass filtering to the audio signal AS, thus enablinginterfering low-frequency components of the audio signal AS to beremoved, which low-frequency components may have a disadvantageouseffect on further processing of the audio signal AS.

The audio preprocessor means 8 also have a stage 15 for generating PCMformat conversion parameters that is arranged to receive the high-passfiltered audio signal AS and to process PCM format information PCMFbelonging to the high-pass filtered audio signal AS, the PCM formatinformation PCMF being represented by the particular audio header. Thestage 15 for generating PCM format conversion parameters is alsoarranged to generate and emit PCM format conversion parameters PCP, byusing the PCM format information PCMF and definable PCM formatconfiguring information PCMC (not shown in FIG. 2) that specifies thestandard PCM format to be produced for the audio signal AS.

The audio preprocessor means 8 also have a conversion-stage implementingstage 16 that is in the form of a software object and that is arrangedto receive and process the PCM format conversion parameters PCP and, byusing these parameters PCP, to implement a PCM format conversion stage17. The PCH format conversion stage 17 is arranged to receive thehigh-pass filtered audio signal AS and to convert it into the audiosignal PAS and to emit the audio signal PAS from the audio preprocessormeans 8. The PCM format conversion stage 17 has (not shown in FIG. 2) aplurality of conversion stages, which can be put into action as afunction of the PCM format conversion parameters PCP, to implement thePMC format conversion stage 17.

The stage 15 for generating PCM format conversion parameters that isshown in detail in FIG. 11 has at the input end a parser stage 15A that,by using the PCM format configuring information PCMC and the PCM formatinformation PCMF, is arranged to set the number of conversion stages atthe format conversion stages 17 and the number of input/output PCMformats individually assigned to them, which is represented by objectspecifying information OSI that can be emitted by it. The PCM formatinformation PCMF defines in this case an input audio signal format tothe stage 15 for generating PCM format conversion parameters and the PCMformat configuring information PCMC defines an output audio signalformat from said stage 15. The stage 15 for generating PCM formatconversion parameters also has a filter planner stage 15B that, by usingthe object specifying information OSI, is arranged to plan furtherproperties for each of the conversion stages, which further propertiesand the object specifying information OSI are represented by the PCMformat conversion parameters PCP that can be generated and emitted bysaid stage 15.

The speech recognition device 1 shown in FIG. 1 also hasreception-channel recognition means 18 that are arranged to receive theaudio signal PAS preprocessed by the audio preprocessor means 8, torecognize the reception channel being used at the time to receive thespeech information SI, to generate channel information CHI representingthe reception channel that is recognized and to emit this channelinformation CHI.

The speech recognition device 1 also has feature-vector extraction means19 that are arranged to receive the audio signal PAS preprocessed by theaudio preprocessor means 8 in the same way as the reception-channelrecognition means 18, and also the channel information CHI and, whiletaking into account the channel information CHI, to generate and emitwhat are termed feature vectors FV, which will be considered in detailat a suitable point in connection with FIG. 3.

The speech recognition device 1 also has first language-propertyrecognition means 20 that are arranged to receive the feature vectors FVrepresenting the speech information SI and to receive the channelinformation CHI. The first language-property recognition means 20 arefurther arranged, by using the feature vectors FV and by continuouslytaking into account the channel information CHI, to recognize a firstlanguage property—namely an acoustic segmentation—and to generate andemit first property information that represents the acousticsegmentation recognized—namely segmentation information ASI.

The speech recognition device 1 also has second language-propertyrecognition means 21 that are arranged to receive the feature vectors FVrepresenting the speech information SI, to receive the channel-statedinformation CHI, and to receive the segmentation information ASI. Thesecond language-property recognition means 21 are further arranged, byusing the feature vectors FV and by continuously taking into account thechannel information CHI and the segmentation information ASI, torecognize a second language property—namely what the language involvedis, i.e. English, French or Spanish for example—and to generate and emitsecond property information that represents the language recognized,namely language information LI.

The speech recognition device 1 also has third language-propertyrecognition means 22 that are arranged to receive the feature vectors FVrepresenting the speech information SI, the channel information CHI, thesegmentation information ASI and the language information LI. The thirdlanguage-property recognition means 22 are further arranged, by usingthe feature vectors FV and by continuously taking into account the itemsof information CHI, ASI and LI, to recognize a third language property,namely a speaker group, and to generate and emit third propertyinformation that represents the speaker group recognized, namely speakergroup information SGI.

The speech recognition device 1 also has fourth language-propertyrecognition means 23 that are arranged to receive the feature vectors FVrepresenting the speech information SI, and to receive the channelinformation CHI, the segmentation information ASI, the languageinformation LI and the speaker group information SGI. The fourthlanguage-property recognition means 23 are further arranged, by usingthe feature vectors FV and by continuously taking into account the itemsof information CHI, ASI, LI and SGI, to recognize a fourth languageproperty, namely a context, and to generate and emit fourth propertyinformation that represents the context recognized, namely contextinformation CI.

The speech recognition device 1 also has speech recognitions means 24that, while continuously taking into account the channel informationCHI, the first item of property information ASI, the second item ofproperty information LI, the third item of property information SGI andthe fourth item of property information CI, are arranged to recognizethe text information TI by using the feature vectors FV representing thespeech information SI and to emit the text information TI.

The speech recognition device 1 also has text-information storage means25, text-information editing means 26 and text-information emittingmeans 27, the means 25 and 27 being arranged to receive the textinformation TI from the speech recognition means 24. Thetext-information storage means 25 are arranged to store the textinformation TI and to make the text information TI available for furtherprocessing by the means 26 and 27.

The text-information editing means 26 are arranged to access the textinformation TI stored in the text-information storage means 25 and toenable the text information TI that can be automatically generated bythe speech recognition means 24 from the speech information SI to beedited. For this purpose, the text-information editing means 26 havedisplay/input means (not shown in FIG. 1) that allow a user, such as aproof-reader for example, to edit the text information TI so thatunclear points or errors that occur in the text information TI in thecourse of the automatic transcription, caused by a conferenceparticipant's unclear or incorrect enunciation or by problems in thetransmission of the audio signal AS, can be corrected manually.

The text-information emitting means 27 are arranged to emit the textinformation TI that is stored in the text-information storage means 25and, if required, has been edited by a user, the text-informationemitting means 27 having interface means (not shown in FIG. 1) totransmit the text information TI in the form of a digital data stream toa computer network and to a display device.

In what follows, it will be explained how the recognition means 18, 20,21, 22, 23 and 24 cooperate over time by reference to a plot of theactivities of the recognition means 18, 20, 21, 22, 23 and 24 that isshown in FIG. 10. For this purpose, the individual activities are shownin FIG. 10 in the form of a bar-chart, where a first activity bar 28represents the activity of the reception-channel recognition means 18, asecond activity bar 29 represents the activity of the firstlanguage-property recognition means 20, a third activity bar 30represents the activity of the second language-property recognitionmeans 21, a fourth activity bar 31 represents the activity of the thirdlanguage-property recognition means 22, a fifth activity bar 32represents the activity of the fourth language-property recognitionmeans 23, and a sixth activity bar 33 represents the activity of thespeech recognition means 24.

The first activity bar 28 extends from a first begin point in time T1Bto a first end point in time T1E. The second activity bar 29 extendsfrom a second begin point in time T2B to a first end point in time T2E.The third activity bar 30 extends from a third begin point in time T3Bto a third end point in time T3E. The fourth activity bar 31 extendsfrom a fourth begin point in time T4B to a fourth end point in time T4E.The fifth activity bar 32 extends from a fifth begin point in time T5Bto a fifth end point in time T5E. The sixth activity bar 33 extends froma sixth begin point in time T6B to a sixth end point in time T6E. Duringthe activity of a given recognition means 18, 20, 21, 22, 23 or 24, thegiven recognition means completely processes the whole of the speechinformation SI, with each of the recognition means 18, 20, 21, 22, 23 or24 beginning the processing of the speech information SI at the start ofthe speech information and at the particular begin point in time T1B,T2B, T3B, T4B, T5B or T6B assigned to it and completing the processingat the particular end point in time T1E, T2E, T3E, T4E, T5E or T6Eassigned to it. There is usually virtually no difference between theoverall processing time-spans that exist between the begin points intime T1B, T2B, T3B, T4B, T5B and T6B and the end points in time T1E,T2E, T3E, T4E, T5E and T6E. Differences may, however, occur in theindividual overall processing time-spans if the respective processingspeeds of the means 18, 20, 21, 22, 23 and 24 differ from one another,which for example has an effect if the speech information SI is madeavailable off-line. What is meant by off-line in this case is forexample that the speech information SI was previously recorded on arecording medium and this medium is subsequently made accessible to thespeech recognition device 1.

Also shown in the chart are start delays d1 to d6 corresponding to therespective recognitions means 18, 20, 21, 22, 23 and 24, with d1=0 inthe present case because the zero point on the time axis T has beenselected to coincide in time with the first begin point in time T1B forthe reception-channel recognition means 18. It should, however, bementioned that the zero point in question can also be selected to besituated at some other point in time, thus making dl unequal to zero.

Also entered in the chart are respective initial processing delays D1 toD6 corresponding to the recognition means 18, 20, 21, 22, 23 and 24,which delays D1 to D6 are caused by the particular recognition means 19,20, 21, 22, 23 and 24 when they generate their respective items ofinformation CHI, ASI, LI, SGI, CI and TI for the first time.Mathematically, the relationship between d_(i) and D_(i) can be summedup as follows, where, by definition, d₀=0 and D₀=0:

d_(i) = d_(i − 1) + D_(i − 1)  i = 1  …  6  and, following  from  this:$d_{i} = {{{\sum\limits_{i = 0}^{i - 1}D_{i}} + {d_{0}\mspace{14mu} i}} = {1\mspace{11mu}\ldots\mspace{11mu} 6.}}$

At the first begin point in time T1B, the reception-channel recognitionmeans 18 begin recognizing the reception channel 3, 5, 6 or 7 that isbeing used at the time to receive the speech information SI. Therecognition of the given reception channel 3, 5, 6 or 7 takes place inthis case, during a first initial processing delay D1, for a sub-area ofa first part of the speech information SI, which first part can betransmitted during the processing delay D1 by the audio preprocessormeans 8 to the reception-channel recognition means 18 in preprocessedform and which first part can be used during the processing delay D1 bythe reception-channel recognition means 18 to allow the receptionchannel 3, 5, 6 or 7 being used to be recognized for the first time. Inthe present case the processing delay D1 is approximately one hundred(100) milliseconds and the first part of the speech information SIcomprises approximately ten (10) so-called frames, with each framerepresenting the speech information SI for a period of approximately 10milliseconds at the audio signal level. At the end of the processingdelay D1, the reception-channel recognition means 18 generate for thefirst time the channel information CHI representing the receptionchannel 3, 5, 6 or 7 that has been recognized, for a first frame of thefirst part of the speech information SI, and transmit this channelinformation CHI to the four language-property recognition means 20 to 23and to the speech recognitions means 24. This is indicated in the chartby the cluster of arrows 34.

As time continues to the end point in time TIE, the reception-channelrecognition means 18 continuously generate or make channel informationCHI, that is updated frame by frame, available for or to the fourlanguage-property recognition means 20 to 23 and the speech recognitionmeans 24, thus enabling the channel information CHI to be continuouslytaken into account by the recognition means 20 to 24 frame by frame. Inthe course of this, and beginning with the second frame of the speechinformation SI, one further part of the speech information SI isprocessed at a time, which part contains a number of frames matched tothe circumstances, and channel information CHI that applies to eachfirst frame, i.e. to the first sub-area of the given part of the speechinformation SI, is generated or made available. Adjoining parts of thespeech information SI, such as the first part and a second part, differfrom one another in this case in that the second part has as a lastframe a frame that is adjacent to the first part but is not contained inthe first part, and in that the first frame of the second part is formedby a second frame of the first part that follows on from the first frameof the first part.

It should be mentioned at this point that, after it is generated for thefirst time, time-spans different than the first initial processing delayD1 may occur in the further, i.e. continuing, generation of the channelinformation CHI, as a function of the occurrence of the audio signal ASon one of the reception channels 3, 5, 6 and 7, and it may thus bepossible for a different number of frames to be covered when generatingthe channel information CHI for the first frame of the given number offrames, i.e. for the first frames of the further parts of the speechinformation SI. It should also be mentioned at this point that adjoiningparts of the speech information SI may also differ by more than twoframes. Another point that should be mentioned is that the sub-area of apart of the speech information SI for which the channel information CHIis generated may also comprise various frames, in which case thesevarious frames are preferably located at the beginning of a part of thespeech information SI. Yet another point that should be mentioned isthat this particular sub-area of a part of the speech information SI forwhich the channel information CHI is generated may also comprise thetotal number of frames contained in the part of the speech informationSI, thus making the particular sub-area identical to the part. A finalpoint that should be mentioned is that that particular sub-area of apart of the speech information SI for which the channel information CHIis generated need not necessarily be the first frame but could equallywell be the second frame or any other frame of the part of the speechinformation SI. It is important for it to be understood in this casethat a frame has precisely one single item of channel information CHIassigned to it.

In anticipation, it should be specified at this point that thestatements made above regarding a part of the speech information SI andregarding that sub-area of the given part of the speech information SIfor which the respective items of information ASI, LI, SGI, CI and TIare generated also apply to the means 20, 21, 22, 23, and 24.

Starting at point in time T2B, the first language-property recognitionmeans 20 begin the recognition for the first time of the acousticsegmentation for the first frame, i.e. for the first sub-area of thefirst part of the speech information SI, doing so with a delay equal tothe starting delay d2 and by using the feature vectors FV representingthe first part of the speech information SI and while taking intoaccount the channel information CHI that has been assigned in each caseto each frame in the first part of the speech information SI. Thestarting delay d2 corresponds in this case to the initial processingdelay D1 caused by the reception-channel recognition means 18. Hence thefirst language-property recognition means 20 are arranged to recognizethe acoustic segmentation for the first frame for the first time with adelay of at least the time-span that is required by thereception-channel recognition means 18 to generate the channelinformation CHI for the first frame. The first language-propertyrecognition means 20 also have a second initial processing delay D2 oftheir own, in which case the segmentation information ASI for the firstframe of the first part of the speech information SI can be generatedfor the first time after this processing delay D2 has elapsed and can betransmitted to the recognition means 21 to 24, which is indicated by asingle arrow 35 that takes the place of a further cluster of arrows thatis not shown in FIG. 10.

Following the processing delay D2, updated segmentation information ASIis continuously generated or made available by the firstlanguage-property recognition means 20 for the further frames of thespeech information SI that occur after its first frame, namely for eachfirst frame of a respective part of the speech information SI, whichthey do while continuously taking into account the channel informationCHI corresponding to each frame of the given part of the speechinformation SI.

Starting at point in time T3B, the second language-property recognitionmeans 21 begin the recognition for the first time of the language forthe first frame, i.e. for the first sub-area of the first part of thespeech information SI, doing so with a delay equal to the starting delayd3 and by using the feature vectors FV representing the first part ofthe speech information SI and while taking into account the channelinformation CHI that has been assigned in each case to each frame in thefirst part of the speech information SI. The starting delay d3corresponds in this case to the sum of the initial processing delays D1and D2 caused by the reception-channel recognition means 18 and thefirst language-property recognition means 20. Hence the secondlanguage-property recognition means 21 are arranged to recognize thelanguage for the first frame for the first time with a delay of at leastthe time-span that is required by the reception-channel recognitionmeans 18 and the language-property recognition means 20 to generate thechannel information CHI and the segmentation information ASI for thefirst frame for the first time. The second language-property recognitionmeans 21 also have a third initial processing delay D3 of their own, inwhich case the language information LI for the first frame of the speechinformation SI can be generated for the first time after this processingdelay D3 has elapsed and can be transmitted to the recognition means 22to 24, which is indicated by a single arrow 36 that takes the place of afurther cluster of arrows that is not shown in FIG. 10.

Following the processing delay D3, updated language information LI iscontinuously generated or made available by the second language-propertyrecognition means 21 for the further frames of the speech information SIthat occur after its first frame, namely for each first frame of therespective part of the speech information SI, which they do whilecontinuously taking into account the items of information CHI and ASIcorresponding to each frame of the given part of the speech informationSI.

Starting at point in time T4B, the third language-property recognitionmeans 22 begin the recognition for the first time of the speaker groupfor the first frame, i.e. for the first sub-area of the first part ofthe speech information SI, doing so with a delay equal to the startingdelay d4 and by using the feature vectors FV representing the first partof the speech information SI and while taking into account the channelinformation CHI, segmentation information ASI and language informationL1 that has been assigned in each case to each frame in the first partof the speech information SI. The starting delay d4 corresponds in thiscase to the sum of the initial processing delays D1, D2 and D3 caused bythe reception-channel recognition means 18, the first language-propertyrecognition means 21 and the second language-property recognition means21. Hence the third language-property recognition means 22 are arrangedto recognize the speaker group for the first frame for the first timewith a delay of at least the time-span that is required by the means 18,20 and 21 to generate the channel information CHI, the segmentationinformation ASI and the language information LI for the first frame forthe first time. The third language-property recognition means 22 alsohave a fourth initial processing delay D4 of their own, in which casethe speaker group information SGI for the first frame can be generatedfor the first time after this processing delay D4 has elapsed and can betransmitted to the recognition means 23 and 24, which is indicated by asingle arrow 37 that takes the place of a further cluster of arrows thatis not shown in FIG. 10.

Following the processing delay D4, updated speaker group information SGIis continuously generated or made available by the thirdlanguage-property recognition means 23 for the further frames of thespeech information SI that occur after its first frame, namely for eachfirst frame of the respective part of the speech information SI, whichthey do while continuously taking into account the items of informationCHI, ASI and LI corresponding to each frame of the given part of thespeech information SI.

Starting at point in time T5B, the fourth language-property recognitionmeans 23 begin the recognition for the first time of the context for thefirst frame, i.e. for the first sub-area of the first part of the speechinformation SI, doing so with a delay equal to the starting delay d5 andby using the feature vectors FV representing the first part of thespeech information SI and while taking into account the channelinformation CHI, segmentation information ASI, language information LIand speaker group information SGI that has been assigned in each case toeach frame in the first part of the speech information SI. The startingdelay d5 corresponds in this case to the sum of the initial processingdelays D1, D2, D3 and D4 caused by the means 18, 20, 21 and 22. Hencethe fourth language-property recognition means 23 are arranged torecognize the context for the first frame with a delay of at least thetime-spans that are required by the means 18, 20, 21 and 22 to generatethe items of information CHI, ASI, L1 and SGI for the first frame forthe first time. The language-property recognition means 23 also have anfifth initial processing delay D5 of their own, in which case thecontext or topic information CI for the first frame of the speechinformation SI can be generated for the first time after this processingdelay D5 has elapsed and can be transmitted to the speech recognitionmeans 24, which is indicated by an arrow 38.

Following the processing delay D5, updated context or topic informationCI is continuously generated or made available by the fourthlanguage-property recognition means 23 for the further frames of thespeech information SI that occur after its first frame, namely for eachfirst frame of the respective part of the speech information SI, whichthey do while continuously taking into account the items of informationCHI, ASI, LI, and SGI corresponding to each frame of the given part ofthe speech information SI.

Starting at point in time T6B, the speech recognition means 24 begin therecognition for the first time of the text information TI for the firstframe, i.e. for the first sub-area of the first part of the speechinformation SI, doing so with a delay equal to the starting delay d6 andby using the feature vectors FV representing the first part of thespeech information SI and while taking into account the channelinformation CHI, segmentation information ASI, language information L1,speaker group information SGI and context or topic information CI thathas been assigned in each case to each frame in the first part of thespeech information SI. The starting delay d6 corresponds in this case tothe sum of the initial processing delays D1, D2, D3, D4 and D5 caused bythe means 18, 20, 21, 22 and 23. Hence the recognition means 24 arearranged to recognize the text information TI for the first frame of thespeech information SI for the first time with a delay of at least thetime-spans that are required by the means 18, 20, 21, 22 and 23 togenerate the items of information CHI, ASI, LI, SGI and CI for the firstframe for the first time. The speech recognition means 24 also have aninitial processing delay D6 of their own, in which case the textinformation TI for the first frame of the speech information SI can begenerated for the first time after this processing delay D6 has elapsedand can be transmitted to the means 25, 26 and 27.

Following the processing delay D6, updated text information TI iscontinuously generated or made available by the speech recognition means24 for the further frames of the speech information SI that occur afterits first frame, namely for each first frame of the respective part ofthe speech information SI, which they do while continuously taken intoaccount the items of information CHI, ASI, LI, SGI and CI correspondingto each frame of the given part of the speech information SI.

Summarizing it can be said in connection with the activities over timethat a frame is processed by one of the recognition stages 20, 21, 22,23 or 24 whenever all the items of information CHI, ASI, LI, SGI or CIrequired by the given recognition stage 20, 21, 22, 23 or 24 forprocessing the given frame are available at the given recognition stage20, 21, 22, 23 or 24.

In the light of the above exposition, the speech recognition device 1 isarranged to perform a speech recognition method for recognizing textinformation TI corresponding to speech information SI, it being possiblefor the speech information SI to be characterized in respect of itslanguage properties, namely the acoustic segmentation, the language, thespeaker group and the context or topic. The speech recognition methodhas the method steps listed below, namely recognition of the acousticsegmentation by using the speech information SI, generation ofsegmentation information ASI representing the acoustic segmentationrecognized, recognition of the language by using the speech informationSI, generation of language information LI representing the languagerecognized, recognition of the speaker group by using the speechinformation SI, generation of speaker group information SGI representingthe speaker group recognized, recognition of the context or topic byusing the speech information SI, generation of context or topicinformation CI representing the context or topic recognized, andrecognition of the text information TI corresponding to the speechinformation SI while taking continuous account of the segmentationinformation ASI, the language information LI, the speaker groupinformation SGI and the context information CI, the generation of theitems of information ASI, LI, SGI and CI, and in particular the way inwhich account is taken of the items of information CHI, ASI, LI and SGIthat are required for this purpose in the respective cases, beingconsidered in detail below.

What is also done in the speech recognition method is that the speechinformation SI is received and, by using the audio signal AS that ischaracteristic of one of the four reception channels 3, 5, 6, and 7, thereception channel being used at the time to receive the speechinformation SI is recognized, an item of channel information CHI whichrepresents the reception channel recognized 3, 5, 6 or 7 is generated,and the channel information CHI is taken into account in the recognitionof the acoustic segmentation, the language, the speaker group, thecontext and the text information TI, the recognition of the receptionchannel 3, 5, 6 or 7 taking place continuously, that is to say frame byframe, for, in each case, the first frame of the given part of thespeech information SI, and, correspondingly thereto, the channelinformation being continuously updated, i.e. regenerated, and beingtaken into account continuously too.

What also occurs in the speech recognition method is that therecognition of the acoustic segmentation is performed while taking intoaccount the channel information CHI corresponding to each frame of therespective part of the speech information SI. The recognition of theacoustic segmentation for the first frame of the given part of thespeech information SI takes place in this case with a delay of at leastthe time-span required for the generation of the channel informationCHI, during which time-span the given part of the speech information SIcan be used to generate the channel information CHI for the first frameof the given part. A further delay is produced by the second processingdelay D2 caused by the first language-property recognition means 20.Following this, the acoustic segmentation is updated frame by frame.

What also occurs in the speech recognition method is that therecognition of the language is performed while taking into account, inaddition, the segmentation information ASI corresponding to each frameof the given part of the speech information SI. The recognition of thelanguage for the first frame of the given part of the speech informationSI takes place in this case with a delay of at least the time-spansrequired for the generation of the channel information CHI and thesegmentation information ASI, during which time-spans the given part ofthe speech information SI can be used to generate the two items ofinformation CHI and ASI for the first frame of the given part. A furtherdelay is produced by the third processing delay D3 caused by the secondlanguage-property recognition means 21. Following this, the language isupdated frame by frame.

What also occurs in the speech recognition method is that therecognition of the speaker group is performed while taking into account,in addition, the segmentation information ASI and language informationLI corresponding to each frame of the given part of the speechinformation SI. The recognition of the speaker group for the first frameof the given part of the speech information SI takes place in this casewith a delay of at least the time-spans required for the generation ofthe channel information CHI, the segmentation information ASI and thelanguage information LI, during which time-spans the given part of thespeech information SI can be used to generate the items of informationCHI, ASI and LI for the first frame of the given part. A further delayis produced by the fourth processing delay D4 caused by the thirdlanguage-property recognition means 22. Following this, the speakergroup is updated frame by frame.

What also occurs in the speech recognition method is that therecognition of the context or topic is performed while taking intoaccount, in addition, the segmentation information ASI, languageinformation LI and speaker group information SGI corresponding to eachframe of the given part of the speech information SI. The recognition ofthe context or topic for the first frame of the given part of the speechinformation SI takes place in this case with a delay of at least thetime-spans required for the generation of the CHI, ASI, LI and SGIinformation, during which time-spans the given part of the speechinformation SI can be used to generate the items of information CHI,ASI, LI and SGI for the sub-area of the given part. A further delay isproduced by the fifth processing delay D5 caused by the fourthlanguage-property recognition means 23. Following this, the context ortopic is updated frame by frame.

What also occurs in the speech recognition method is that, while takinginto account the CHI, ASI, LI, SGI and CI information corresponding toeach frame of the given part of the speech information SI, therecognition of the text information TI corresponding to the speechinformation TI is performed for the first frame of the given part of thespeech information SI with a delay of at least the time-spans requiredfor the generation of the channel information CHI, the segmentationinformation ASI, the language information LI, the speaker groupinformation ASI and the context or topic information CI, during whichtime-spans the given part of the speech information SI can be used togenerate the items of information CHI, ASI, LI, SGI and CI for the firstframe of the given part. A further delay is produced by the sixthprocessing delay D6 caused by the speech recognition means 24. Followingthis, the text information TI is updated frame by frame.

The speech recognition method is performed with the computer 1A when thecomputer program product is run on the computer 1A. The computer programproduct is stored on a computer-readable medium that is not shown inFIG. 1, which medium is formed in the present case by a compact disk(CD). It should be mentioned at this point that a DVD, a tape-like datacarrier or a hard disk may be provided as the medium. In the presentcase the computer has as its processing unit a single microprocessor. Itshould however be mentioned that, for reasons of performance, aplurality of microprocessors may also be provided, such for example as adedicated microprocessor for each of the recognition means 18, 20, 21,22, 23 and 24. The internal memory 1B of the computer 1A is formed inthe present case by a combination of a hard disk (not shown in FIG. 1)and working memory 39 formed by what are termed RAM's, which means thatthe computer program product can first be stored onto the hard disk fromthe computer-readable medium and can be loaded into the working memory39 for running by means of the processing unit, as will be sufficientlyfamiliar to the man skilled in the art. The memory 1B is also arrangedto store the preprocessed audio signal PAS and the items of informationCHI, ASI, LI, SGI and CI and to store items of temporal correlation data(not shown in FIG. 1). The items of temporal correlation data representa temporal correlation between the sub-areas of the speech informationSI and the items of information CHI, ASI, LI, SGI and CI thatrespectively, correspond to these sub-areas, to enable the acousticsegmentation, the language, the speaker group, the context or topic andthe text information TI for the given sub-area of the speech informationSI to be recognized with the correct temporal synchronization.

What is achieved in an advantageous way by the provision of the featuresaccording to the invention is that the speech recognition device 1 orthe speech recognition method can be used for the first time in anapplication in which a plurality of language properties characteristicof the speech information SI are simultaneously subject to a changeoccurring substantially at random points in time. An application of thiskind exists in the case of, for example, a conference transcriptionsystem, where speech information SI produced by random conferenceparticipants has to be converted into text information TI continuouslyand approximately in real time, in which case the conferenceparticipants, in a conference room, supply the speech information SI tothe speech recognition device 1 via the first reception channel 3 bymeans of the audio signal AS. The conference participants may usedifferent languages in this case and may belong to different individualspeaker groups. Also, circumstances may occur during a conference, suchas background noise for example, which affect the acoustic segmentation.Also, the context or topic being used at the time may change during theconference. What also becomes possible in an advantageous way is forconference participants who are not present in the conference room alsoto supply the speech information SI associated with them to the speechrecognition device 1, via further reception channels 5, 6 and 7. Even inthis case, there is an assurance in the case of the speech recognitiondevice 1 that the text information TI will be reliably recognized,because the reception channel 3, 5, 6 or 7 being used in the given caseis recognized and continuous account is taken of it in the recognitionof the language properties, i.e. in the generation and updating of theitems of information CHI, ASI, LI, SCI and CI.

An application of this kind also exists when, at a call center forexample, a record is to be kept of calls by random persons, who may beusing different languages.

An application of this kind also exists when, in the case of anautomatic telephone information service for example, callers of anydesired kinds are to be served. It should be expressly made clear atthis point that the applications that have been cited here do notrepresent a full and complete enumeration.

The feature-vector extraction means 19 shown in FIG. 3 have apre-emphasis stage 40 that is arranged to receive the audio signal ASand to emit a modified audio signal AS″ representing the audio signalAS, higher frequencies being emphasized in the modified audio signal AS″to level out the frequency response. Also provided is a frame-blockingstage 41 that is arranged to receive the modified audio signal AS″ andto emit parts of the modified audio signal AS″ that are embedded inframes F. The adjacent frames F of the audio signal AS″ have a temporaloverlap in their edge regions in this case. Also provided is a windowingstage 42 that is arranged to receive the frames F and to generatemodified frames F′ representing the frames F, which modified frames F′are limited in respect of the bandwidth of the audio signal representedby the frames F, to avoid unwanted effects at a subsequent conversion tothe spectral level. A so-called Hemming window is used in the windowingstage in the present case. It should however be mentioned that othertypes of window may be used as well. Also provided is a fast Fouriertransformation stage 43 that is arranged to receive the modified framesF′ and to generate vectors V1 on the spectral level corresponding to thebandwidth-limited audio signal AS″ contained in the modified frames F′,a so-called “zero-padding” method being used in the present case. Alsoprovided is a logarithmic filter bank stage 44 that is arranged toreceive the first vectors V1 and the channel information CHI and, usingthe first vectors V1 and while taking into account the channelinformation CHI, to generate and emit second vectors V2, the secondvectors V2 representing a logarithmic mapping of intermediate vectorsthat can be generated from the first vectors V1 by a filter bank method.

The logarithmic filter bank stage 44 that is shown in FIG. 12 has afilter-bank parameter pool stage 44A that stores a pool of filter-bankparameters. Also provided is a filter parameter selecting stage 44B thatis arranged to receive the channel information CHI and to selectfilter-bank parameters FP corresponding to the channel information CHI.Also provided is what is termed a logarithmic filter-bank core 44C thatis arranged to process the first vectors V1 and to generate the secondvectors V2 as a function of the filter-bank parameters FP receivablefrom the filter parameter selecting stage 44B.

The feature-vector extraction means 19 shown in FIG. 3 also have a firstnormalizing stage 45 that is arranged to receive the second vectors V2and to generate and emit third vectors V3 that are free of means inrespect of the amplitude of the second vectors V2. This ensures thatfurther processing is possible irrespective of the particular receptionchannel involved. Also provided is a second normalizing stage 46 that isarranged to receive the third vectors V3 and, while taking into accountthe temporal variance applicable to each of the components of the thirdvectors V3, to generate fourth vectors V4 that are normalized in respectof the temporal variance of the third vectors V3. Also provided is adiscrete cosine transformation stage 47 that is arranged to receive thefourth vectors V4 and to convert the fourth vectors V4 to the so-called“cepstral” level and to emit fifth vectors V5 that correspond to thefourth vectors V4. Also provided is a feature-vector generating stage 48that is arranged to receive the fifth vectors V5 and to generate thefirst and second time derivatives of the fifth vectors V5, which meansthat the vector representation of the audio signal AS in the form of thefeature vectors FV, which representation can be emitted by thefeature-vector generating stage 48, has the fifth vectors V5 on the“cepstral” level and the time derivatives corresponding thereto.

The reception-channel recognition means 18 shown in FIG. 4 have at theinput end a spectral-vector extraction stage 49 that is arranged toreceive the audio signal AS and to extract and emit spectral vectors V6,which spectral vectors V6 represent the audio signal AS on the spectrallevel. The reception-channel recognition means 18 further have abandwidth-limitation recognition stage 50 that is arranged to receivethe spectral vectors V6 and, by using the spectral vectors V6, torecognize a limitation of the frequency band of the audio signal AS, thebandwidth limitation found in the particular case being representativeof one of the four reception channels. The bandwidth-limitationrecognition stage 50 is also arranged to emit an item ofbandwidth-limitation information BWI that represents the bandwidthlimitation recognized. The reception-channel recognition means 18further have a channel classifying stage 51 that is arranged to receivethe bandwidth-limitation information BWI and, by using this informationBWI, to classify the reception channel that is current at the time andto generate the channel information CHI corresponding thereto.

The first language-property recognition means 20 shown in FIG. 5 have aspeech-pause recognition stage 52, a non-speech recognition stage 53 anda music recognition stage 53, to each of which recognition stages 52, 53and 54 the feature vectors can be fed. The speech-pause recognitionstage 52 is arranged to recognize feature vectors FV representing pausesin speech and to emit an item of speech-pause information SIrepresenting the result of the recognition. The non-speech recognitionstage 53 is arranged to receive the channel information CHI and, whiletaking the channel information CHI into account, to recognize featurevectors FV representing non-speech and to emit an item of non-speechinformation NSI representing non-speech. The music recognition stage 54is arranged to receive the channel information CHI and, while taking thechannel information CHI into account, to recognize feature vectors FVrepresenting music and to generate an emit an item of music informationMI representing the recognition of music. The first language-propertyrecognition means 20 further have an information analyzing stage 55 thatis arranged to receive the speech-pause information SI, the non-speechinformation NSI and the music information MI. The information analyzingstage 55 is further arranged to analyze the items of information SI, NSIand MI and, as a result of the analysis, to generate and emit thesegmentation information ASI, the segmentation information ASI statingwhether the frame of the audio signal AS that is represented at the timeby the feature vectors FV is associated with a pause in speech ornon-speech or music, and, if the given frame is not associated eitherwith a pause in speech, or with non-speech or with music, stating thatthe given frame is associated with speech.

The music recognition stage 54 that is shown in detail in FIG. 13 isarranged to recognize music in a trainable manner and for this purposeis arranged to receive segmentation training information STI. The musicrecognition stage 54 has a classifying stage 56 that, with the help oftwo groups of so-called “Gaussian mixture models” is arranged toclassify the feature vectors FV into feature vectors FV representingmusic and feature-vectors FV representing non-music. Each first Gaussianmixture model GMM1 belonging to the first group is assigned to a musicclassification and each second Gaussian mixture model GMM2 belonging tothe second group is assigned to a non-music classification. Theclassifying stage 56 is also arranged to emit the music information MIas a result of the classification. The music recognition stage 54further has a first model selecting stage 57 and a first model storagestage 58. For each of the reception channels, the first model storagestage 58 is arranged to store a Gaussian mixture model GMM1 assigned tothe music classification and a Gaussian mixture model GMM2 assigned tothe non-music classification. The first model selecting stage 57 isarranged to receive the channel information CHI and, with the help ofthe channel information CHI, to select a pair of Gaussian mixture modelsGMM1 and GMM2 which correspond to the reception channel stated in thegiven case, and to transmit the Gaussian mixture models GMM1 and GMM2selected in this channel-specific manner to the classifying stage 56.

The music recognition stage 54 is further arranged to train the Gaussianmixture models, and for this purpose it has a first training stage 59and a first data-stream control stage 60. In the course of the training,feature vectors FV that, in a predetermined way, each belong to a singleclass, namely music or non-music, can be fed to the first training stage59 with the help of the data-stream control stage 60. The training stage59 is also arranged to train the channel-specific pairs of Gaussianmixture models GMM1 and GMM2. The first model selecting stage 57 isarranged to transmit the Gaussian mixture models GMM1 and GMM2 to thestorage locations intended for them in the first model storage stage 58,with the help of the channel information CHI and the segmentationtraining information STI.

The second language-property recognition means 21 shown in FIG. 6 haveat the input end a first speech filter stage 61 that is arranged toreceive the feature vectors FV and the segmentation information ASI and,by using the feature vectors FV and the segmentation information ASI, tofilter out feature vectors FV representing speech and to emit thefeature vectors FV representing speech. The second language-propertyrecognition means 21 further have a second model storage stage 62 thatis arranged and intended to store a multi-language first phoneme modelPM1 for each of the four reception channels. The recognition means 21further have a second model selecting stage 63 that is arranged toreceive the channel information CHI and, by using the channelinformation CHI, to access, in the second model storage stage 62, themultilanguage phoneme model PM1 that corresponds to the receptionchannel stated by the channel information CHI and to emit thechannel-specific multi-language phoneme model PM1 that has been selectedin this way. The recognition means 21 further have a phoneme recognitionstage 64 that is arranged to receive the feature vectors FV representingspeech and the phoneme model PM1 and, by using the feature vectors FVand the phoneme model PM1, to generate and emit a phonetic transcriptionPT of the language represented by the feature vectors FV. Therecognition means 21 further have a third model storage stage 65 that isarranged and intended to store a phonotactic model PTM for eachlanguage. The recognition means 21 further have a second classifyingstage 66 that is arranged to access the third model storage stage 65and, with the help of the phonotactic model PTM, to classify thephonetic transcription PT phonotactically, the probability of a languagebeing present being determinable for each available language. The secondclassifying stage 66 is arranged to generate and emit the languageinformation LI as a result of the determination of the probabilitycorresponding to each language, the language information LI giving thelanguage for which the probability found was the highest.

The recognition means 21 can also be acted on in a trainable way inrespect of the recognition of language and for this purpose have asecond data-stream control stage 67, a third data-stream control stage68, a second training stage 69 and a third training stage 70. In theevent of training, the feature vectors FV representing speech can be fedto the second training stage 69 with the help of the second data-streamcontrol stage 67. The second training stage 69 is arranged to receivethese feature vectors FV, to receive training text information TTI andto receive the channel information CHI, in which case a phonetictranscription made from the training text information TTI corresponds tothe language represented by the feature vectors FV. Hence, by using thefeature vectors FV and the training text information TTI, the secondtraining stage 69 is arranged to train the phoneme model PM1 and totransmit the trained phoneme model PM1 to the model selecting stage 63.The model selecting stage 63 is further arranged, with the help of thechannel information CHI, to transmit the trained phoneme model PM1 tothe second model storage stage 62, where it can be stored at a storagelocation in said second model storage stage 62 that corresponds to thechannel information CHI.

In the event of training, the phonetic transcription PT able to be madeby the phoneme recognition stage 64 can also be fed to the thirdtraining stage 70 with the help of the third data-stream control stage68. The third training stage 70 is arranged to receive the phonetictranscription PT, to train a phonotactic model PTM assigned to the giventraining language information TLI and to transmit it to the third modelstorage stage 65. The third model storage stage 65 is arranged to storethe phonotactic model PTM belonging to a language at a storage locationcorresponding to the training language information TLI. It should bementioned at this point that the models PM1 and PM2 stored in the secondmodel storage stage 62 and the third model storage stage 65 are referredto in the specialist jargon as trainable resources.

The second training stage 69 is shown in detail in FIG. 14 and has afourth model storage stage 71, a third model selecting stage 72, a modelgrouping stage 73, a model aligning stage 74 and a model estimatingstage 75. The fourth model storage stage 71 is arranged and intended tostore a channel-specific and language-specific initial phoneme model IPMfor each channel and each language. The third model selecting stage 72is arranged to access the fourth model storage stage 71 and to receivethe channel information CHI and, by using the channel information CHI,to read out the initial phoneme model IPM corresponding to the channelinformation CHI, for all languages. The third model selecting stage 72is further arranged to transmit a plurality of language-specific phonememodels IPM corresponding to the given channel to the model groupingstage 73. The model grouping stage 73 is arranged to group togetherlanguage-specific phoneme models IPM that are similar to one another andbelong to different languages and to generate an initial multi-languagephoneme model IMPM and to transmit it to the model aligning stage 74.The model aligning stage 74 is arranged to receive the feature vectorsFV representing speech and the training text information TTIcorresponding thereto and, with the help of the initial multi-languagephoneme model IMPM, to generate items of alignment information RE thatare intended to align the feature vectors FV with sections of textrepresented by the training text information TTI, the items of alignmentinformation RE also being referred to in the specialist jargon as“paths”. The items of alignment information RE and the feature vectorsFV can be transmitted to the model estimating stage 75 by the modelaligning stage 74. The model estimating stage 75 is arranged, by usingthe items of alignment information RE and the feature vectors FV, togenerate the multi-language phoneme model PM1 based on the initialmulti-language phoneme model IMPM and to transmit it to the second modelstorage stage 62 shown in FIG. 7. For this purpose and using the featurevectors FV and the alignment information RE, a temporary multi-languagephoneme model TMPM is generated and transmitted to the model estimatingstage 74, the multi-language phoneme model PM1 being generated in aplurality of iterative stages, i.e. by repeated co-operation of thestages 74 and 75.

The third language-property recognition means 22 shown in FIG. 7 have atthe input end a second speech filter stage 76 that is arranged toreceive the feature vectors FV and the segmentation information ASI and,by using the segmentation information ASI, to filter out and emitfeature vectors FV representing speech. The recognition means 22 alsohave a fifth model storage stage 77 that is arranged and intended tostore speaker group models SGM for each channel and each language. Therecognition means 22 further have a fourth model selecting stage 78 thatis arranged to receive the channel information CHI and the languageinformation LI and, by using the channel information CHI and thelanguage information LI, to access the given speaker group model SGMthat corresponds to the given channel information CHI and the givenlanguage information LI. The fourth model selecting stage 78 is alsoarranged to transmit the speaker group model SGM that can be read out asa result of the access to the fifth model storage stage 77. Therecognition means 22 further have a third classifying stage 79 that isarranged to receive the speaker group model SGM selected as a functionof items of information CHI and LI by the fourth model selecting stage78 and to receive the feature vectors FV representing speech and, withthe help of the speaker group model SGM selected, to classify thespeaker group to which the feature vectors FV can be assigned. The thirdclassifying stage 79 is further arranged to generate and emit thespeaker group information SGI as a result of the classification.

By means of the fifth model storage stage 77, a further trainableresource is implemented, the speaker group models SGM stored thereinbeing alterable in a trainable manner. For this purpose, the recognitionmeans 22 have a fourth training stage 80 and a fourth data-streamcontrol stage 81. In the event of training, feature vectors FVrepresenting the language can be fed to the fourth training stage 80with the help of the fourth data-stream control stage 81. For a numberof speakers, the fourth training stage 80 is arranged to receive featurevectors FV assigned to respective ones of the speakers and the trainingtext information TTI corresponding to each of the feature vectors FV, totrain the given speaker group model SGM and to transmit the giventrained speaker group model SGM to the fourth model selecting stage 78.

The fourth training stage 80 that is shown in detail in FIG. 15 has asixth model storage stage 82, a fifth model selecting stage 83, a modeladaption stage 84, a buffer storage stage 85 and a model grouping stage86. The sixth model storage stage 82 is arranged and intended to storespeaker-independent phoneme models SIPM for each channel and eachlanguage. The fifth model selecting stage 83 is arranged to receive thechannel information CHI and the language information L1 and, by usingthese two items of information CHI and LI, to access the sixth modelstorage stage 82, or rather the initial speaker-independent phonememodel SIPM corresponding to the given items of information CHI and LI,and to emit the speaker-independent phoneme model SIPM that has beenselected and is now channel-specific and language-specific.

The model adaption stage 84 is arranged to receive the initialspeaker-independent phoneme model SIPM that was selected in accordancewith the channel information CHI and the language information LI and isthus channel-specific and language-specific, feature vectors FVrepresenting the language, and the training text information TTIcorresponding to these latter. For a plurality of speakers whose speechinformation SI is represented by the feature vectors FV, the modeladaption stage 84 is further arranged to generate one speaker model SMeach and to transmit it to the buffer storage stage 85, in which thegiven speaker model SM is storable. The speaker model SM is generated onthe basis of the speaker-independent phoneme model SIPM by using anadaption process. Once the speaker models SM have been stored for theentire number of speakers, a grouping together of the plurality ofspeaker models into individual speaker group models SGM can be performedby means of the model grouping stage 86 in the light of similar speakerproperties. The individual speaker group models SGM can be transmittedto the model selecting stage 78 and can be stored by the model selectingstage 78 in the model storage stage 77 by using the items of informationCHI and LI.

The fourth language-property recognition means 23 that are shown in FIG.8 have a stage 88 for recognizing keyword phoneme sequences, a keywordrecognition stage 89 and a stage 90 for assigning keywords to a contextor topic. The stage 88 is arranged to receive the feature vectors FV, toreceive a second phoneme model PM2 that is channel-specific,language-specific and speaker-group-specific, and to receive keywordlexicon information KLI. The stage 88 is further arranged, by using thesecond phoneme model PM2 and the keyword lexicon information KLI, torecognize a keyword sequence represented by the feature vectors FV andto generate and emit keyword rating information KSI that represents akeyword that has been recognized and the probability with which it wasrecognized. The keyword recognition stage 89 is arranged to receive thekeyword rating information KSI and to receive a keyword decisionthreshold value KWDT that is dependent on the reception channel, thelanguage, the speaker group and the keyword. The stage 89 is furtherarranged, with the help of the keyword decision threshold value KWDT, torecognize which of the keywords received by means of the keyword ratinginformation KSI were recognized. The keyword recognition stage 89 isarranged to generate keyword information KWI as a result of thisrecognition and to transmit said keyword information KWI to the stage 90for assigning keywords to a context or topic. The stage 90 for assigningkeywords to a topic is arranged to assign the keyword received with thehelp of the keyword information KWI to a context, which is often alsoreferred to in the specialist jargon as a topic. The stage 90 forassigning keywords to a context or topic is arranged to generate thecontext information CI as a result of this assignment. The fourthlanguage-property recognition means 23 further have a seventh modelstorage stage 91 that is arranged and intended to store the secondphoneme model PM2 for each reception channel, each language and eachspeaker group. The recognition means 23 further have a sixth modelselecting stage 92 that is arranged to receive the channel informationCHI, the language information LI and the speaker group information SGI.The sixth model selecting stage 92 is further arranged, with the help ofthe channel information CHI, the language information LI and the speakergroup information SGI, to select a second phoneme model PM2 stored inthe seventh model storage stage 91 and to transmit the second phonememodel PM2 selected to the stage 88 for recognizing keyword phonemesequences.

The recognition means 23 further have a keyword lexicon storage stage 93and a language selecting stage 94. The keyword lexicon storage stage 93is arranged and intended to store keywords for every language available.The language selecting stage 94 is arranged to receive the languageinformation LI and to access the keyword lexicon storage stage 93, inwhich case, with the help of the language information LI, keywordlexicon information KLI that corresponds to the language information LIand represents the keywords in a language, can be transmitted to thestage 88 for recognizing keyword phoneme sequences. The recognitionmeans 23 further have a threshold-value storage stage 95 that isarranged and intended to store keyword decision threshold values KWDTthat depend on the given reception channel, the language, the speakergroup and the keyword. The recognition means 23 further have athreshold-value selecting stage 96 that is arranged to receive thechannel information CHI, the language information LI and the speakergroup information SGI. The threshold-value selecting stage 96 is furtherarranged to access the keyword decision threshold values KWDT,corresponding to the items of information CHI, LI and SGI, that arestored in the threshold-value storage stage 95. The threshold-valueselecting stage 96 is further arranged to transmit the keyword decisionthreshold value KWDT that has been selected in this way to the keywordrecognition stage 89.

The recognition means 23 are further arranged to recognize the contextor topic information CI in a trainable manner, two trainable resourcesbeing formed by the seventh model storage stage 91 and thethreshold-value storage stage 95. The recognition means 23 further havea fifth training stage 97, a sixth training stage 98, a fifthdata-stream control stage 99 and a sixth data-stream control stage 100.When the recognition means 23 are to be trained, the feature vectors FVcan be fed to the fifth training stage 97 by means of the sixthdata-stream control stage 100. The fifth training stage 97 is furtherarranged to receive the feature vectors FV and the training textinformation TTI corresponding thereto and, with the help of a so-calledViterbi algorithm, to generate one of the second phoneme models PM2 andtransmit it to the sixth model selecting stage 92, as a result of whichthe second phoneme models PM2 are generated for each channel, eachlanguage and each speaker group. By means of the model selecting stage92, the second phoneme models PM2 can be stored in the model storagestage 91 at storage locations that are determinable with the help of theitems of information CHI, LI and SGI. By means of the fifth data-streamcontrol stage 99, the keyword lexicon information KLI can also be fed tothe sixth training stage 98. In a training process, the stage 88 forrecognizing keyword phoneme sequences is arranged to recognize a phonemesequence in feature vectors FV that represent the language, and togenerate an item of phoneme rating information PSI representing thephoneme sequence that has been recognized and to transmit it to thesixth training stage 98, the phoneme rating information PSI representingthe phonemes that have been recognized and, for each of them, theprobability with which it was recognized.

The sixth training stage 98 is arranged to receive the phoneme ratinginformation PSI and the keyword lexicon information KLI and, by usingthese two items of information PSI and KLI, to generate, i.e. to train,a keyword decision threshold value KWDT corresponding to the items ofinformation CHI, LI and SGI and to transmit it to the threshold-valueselecting stage 96. The threshold-value selecting stage 96 is arranged,by using the items of information CHI, LI and SGI, to transmit thekeyword decision threshold value KWDT to the threshold value storagemeans 95. By means of the threshold value selecting stage 96, thekeyword decision threshold value KWDT can be stored at a storagelocation determined by means of the items of information CHI, LI andSGI.

The sixth training stage 98 shown in detail in FIG. 16 has a stage 101for estimating phoneme distribution probabilities that is arranged toreceive the phoneme rating information psi and to estimate a statisticaldistribution for the phonemes spoken and the phonemes not spoken, on theassumption that a Gaussian distribution applies in each case. Stage 101is thus arranged to generate and emit a first item of estimatinginformation El as a result of this estimating process. The sixthtraining stage 98 further has a stage 102 for estimating keywordprobability distributions that is arranged to receive the first item ofestimating information El and the keyword lexicon information KLI. Stage102 is further arranged, by using the two items of information KLI andEI, to estimate a statistical distribution for the keywords spoken andthe keywords not spoken. Stage 102 is further arranged to generate andemit a second item of estimating information E2 as a result of thisestimating process. The sixth training stage 98 further has a stage 103for estimating keyword decision threshold values that, by using thesecond item of estimating information E2, is arranged to estimate theparticular keyword decision threshold value KWDT and to emit the keyworddecision threshold value KWDT as a result of this estimating process.

The speech recognition means 24 shown in detail in FIG. 9 have at theinput end a third speech filter stage 104 that is arranged to receivethe feature vectors FV and to receive the segmentation information ASIand, by using the segmentation information ASI, to filter the filtervectors FV received and to emit feature vectors FV representing speech.

The recognition means 24 further have a speech pattern recognition stage105 that is arranged to receive the filter vectors FV representingspeech, to receive a third phoneme model PM3 and to receive context ortopic data CD. The speech pattern recognition stage 105 is furtherarranged, by using the third phoneme model PM3 and the context data CD,to recognize a pattern in the feature vectors FV that represent speechand, as a result of recognizing a pattern of this kind, to generate andemit word graph information WGI. The word graph information WGIrepresents graphs of words or word sequences and their associated itemsof probability information that state the probability with which it ispossible for the words or word sequences to occur in the particularlanguage spoken.

The recognition means 24 further have a graph rating stage 106 that isarranged to receive the word graph information WGI and to find whichpath in the graph has the best word sequence in respect of therecognition of the text information TI. The graph rating stage 106 isfurther arranged to emit reformatted text information TI′ correspondingto the best word sequence as a result of the finding of this best wordsequence.

The recognition means 24 further have a formatting storage stage 107 anda formatting stage 108. The formatting storage stage 107 is arranged tostore formatting information FI, by means of which rules can berepresented that state how the reformatted text information TI′ is to beformatted. The formatting stage 108 is arranged to receive thereformatted text information TI′ and to access the formatting storagestage 107 and read out the formatting information FI. The formattingstage 108 is further arranged, by using the formatting information FI,to format the reformatted text information TI′ and to generate and emitthe text information TI as a result of the formatting.

The recognition means 24 further have a seventh model storage stage 109that is arranged and intended to store a third phoneme model PM3 foreach reception channel, each language and each speaker group. Alsoprovided is a seventh model selecting stage 110 that is arranged toreceive the channel information CHI, the language information LI and thespeaker group information SGI. The seventh model selecting stage 110 isfurther arranged, by using the items of information CHI, LI and SGI, toaccess the third phoneme model PM3 corresponding to these items ofinformation CHI, LI and SGI in the seventh model storage stage 109 andto read out this channel-specific, language-specific andspeaker-group-specific third phoneme model PM3 to the speech patternrecognition stage 105. The recognition means 24 further have a contextor topic storage stage 111. The context or topic storage stage 111 isintended to store the context or topic data CD, which context data CDrepresents lexicon information LXI, and a language model LMcorresponding to the lexicon information LXI, for each item of contextor topic information CI and each language. The context storage stage 111has a lexicon storage area 113 in which the particular lexiconinformation LXI can be stored, which lexicon information LXI compriseswords and phoneme transcriptions of the words. The context or topicstorage stage 111 has a language model storage stage 112 in which alanguage model LM corresponding to the given lexicon information LXI canbe stored. The recognition means 24 further have a context or topicselecting stage 114 that is arranged to receive the context or topicinformation CI.

It should be mentioned at this point that the language information isnot explicitly fed to the context selecting stage 114 because thecontext information implicitly represents the language.

The context or topic selecting stage 114 is further arranged, by usingthe context or topic information CI and the information on the givenlanguage implicitly represented thereby, to access the language model LMthat, in the context storage stage 111, corresponds to the given contextor topic information CI, and the lexicon information LXI, and totransmit the selected language model LM and the selected lexiconinformation LXI in the form of the context data CD to the speech patternrecognition stage 105.

The speech recognition means 24 are further arranged to generate thethird phoneme model PM3, the lexicon information LX1 and each languagemodel LM corresponding to a set of lexicon information LXI, in atrainable manner. In this connection, the seventh model storage stage109 and the context storage stage 111 form trainable resources of therecognition means 24.

For the purpose of training the trainable resources, the recognitionmeans 24 have a seventh data-stream control stage 115 and a seventhtraining stage 116. In the event of training, the seventh data-streamcontrol stage 115 is arranged to transmit the feature vectors FVrepresenting speech not to the speech pattern recognition stage 105 butto the seventh training stage 116. The seventh training stage 116 isarranged to receive the feature vectors FV representing speech and thetraining text information TTI corresponding thereto. The seventhtraining stage 116 is further arranged, by using the feature vectors FVand the training text information TTI and with the help of a Viterbialgorithm, to generate the given third phoneme model PM3 and transmit itto the seventh model selecting stage 110, thus enabling the third,trained phoneme model PM3, which corresponds to the channel informationCHI, the language information LI or the speaker group information SGI,as the case may be, to be stored with the help of the seventh modelselecting stage 110 in the seventh model storage stage 109 at a storagelocation defined by the items of information CHI, SGI and LI.

The recognition means 24 further have a language model training stage117 that is arranged to receive a relatively large training text, whichis referred to in the specialist jargon as a corpus and is representedby corpus information COR. The language model training stage 117 isarranged, by using the corpus information COR and with the help of thetopic stated by information CI and the lexicon information LXIdetermined by the language implicitly stated by the information CI, totrain or generate the language model LM corresponding to each item ofcontext or topic information CI and the language implicitly representedthereby, the lexicon information LXI determined in this way being ableto be read out from the lexicon storage stage 113 with the help of thecontext selecting stage 114 and to be transmitted to the language modeltraining stage 117. The language model training stage 117 is arranged totransmit the trained language models LM to the context selecting stage114, after which the language model LM is stored by means of the contextselecting stage 114 and by using the information CI it stored at thestorage location in the speech model storage area 112 that is intendedfor it.

The recognition means 24 further have a lexicon generating stage 118that is likewise arranged to receive the corpus information COR and, byusing the corpus information COR, to generate lexicon information LXIcorresponding to each item of context information and to the languageimplicitly represented thereby and to transmit it to the contextselecting stage 114, after which the lexicon information LXI is stored,with the help of the context selecting stage 114 and by using theinformation CI, at the storage location in the speech model storage area112 that is intended for it. For the purpose of generating the lexiconinformation LXI, the recognition means 24 have a background lexiconstorage stage 119 that is arranged to store a background lexicon, whichbackground lexicon contains a basic stock of words and associatedphonetic transcriptions of words that, as represented by backgroundtranscription information BTI, can be emitted. The recognition means 24further have a statistical transcription stage 120 that, on the basis ofa statistical transcription process, is arranged to generate a phonetictranscription of words contained in the corpus that can be emitted in aform in which it is represented by statistical transcription informationSTI.

The recognition means 24 further have a phonetic transcription stage 121that is arranged to receive each individual word in the corpus textinformation CTI containing the corpus and, by taking account of thecontext or topic information CI and the information on the languageimplicitly contained therein, to make available for and transmit to thelexicon generating stage 118 a phonetic transcription of each word ofthe corpus text information CTI in the form of corpus phonetictranscription information CPTI. For this purpose the phonetictranscription stage 121 is arranged to check whether a suitable phonetictranscription is available for the given word in the background lexiconstorage stage 119. If one is, the information BTI forms the informationCPTI. If a suitable transcription is not available, then the phonetictranscription stage 121 is arranged to make available the informationSTI representing the given word to form the information CTI.

It should be mentioned at this point that the third phoneme model PM3 isalso referred to as acoustic references, which means that the trainableresources comprise the acoustic references and the context or topic.

It should also be mentioned at this point that a so-called traininglexicon is employed at each of the stages 69, 80, 97 and 116, by meansof which a phonetic transcription required for the given trainingoperation is generated from the training text or corpus information TTI.

In the speech recognition means 24, the items of information ASI, LI,SGI and CI that can be generated in a multi-stage fashion and eachrepresent a language property produce essentially three effects. A firsteffect is that the filtering of the feature vectors FV is controlled bymeans of the segmentation information ASI at the third speech filterstage 104. This gives the advantage that the recognition of the textinformation TI can be performed accurately and swiftly, and autonomouslyand regardless of any prior way in which the feature vectors FVrepresenting the speech information SI may have been affected, bybackground noise for example. A second effect is that, with the help ofthe channel information CHI, the language information LI and the speakergroup information SGI, the selection of an acoustic referencecorresponding to these items of information is controlled at theresources. This gives the advantage that a considerable contribution ismade to the accurate recognition of the text information TI because theacoustic reference models the acoustic language property of the languagewith great accuracy. A third effect is that the selection of a contextor topic is controlled at the resources with the help of the context ortopic information. This gives the advantage that a further positivecontribution is made to the accurate and swift recognition of the textinformation TI. With regard to accurate recognition, the advantage isobtained because a selectable topic models the actual topic that existsin the case of a language far more accurately than would be the case ifthere were a relatively wide topic that was rigidly preset. With regardto swift recognition, the advantage is obtained because the particularvocabulary corresponding to one of the items of context or topicinformation CI covers only some of the words in a language and cantherefore be relatively small and hence able to be processed at acorrespondingly high speed.

In the present case it has proved advantageous for the recognitionstages 21, 22 and 24 each to have a speech filter stage 61, 76 and 104of their own. Because of its function, the recognition stage 23implicitly contains speech filtering facilities. It should be mentionedthat in place of the three speech filter stages 61, 76 and 104 there mayalso be provided a single speech filter stage 122 as shown in FIG. 1that is connected upstream of the recognition stages 21, 22, 23 and 24,which does not however have any adverse effect on the operation ofrecognition stage 23. This would give the advantage that the threespeech filter stages 61, 76 and 104 would become unnecessary and, undercertain circumstances, the processing of the feature vectors FV couldtherefore be performed more quickly as well.

It should be mentioned that, in place of the feature-vector extractionmeans 19 connected upstream of the means 20 to 24, each of the means 20to 24 may have an individual feature-vector extraction means assigned toit, to which the preprocessed audio signal PAS can be fed. This makes itpossible for each of the individual feature-vector extraction means tobe optimally and individually adapted to the operation of its respectivemeans 20 to 24. This gives the advantage that the vector representationof the preprocessed audio signal PAS can also take place in anindividually adapted manner on a level other than the cepstral level.

It should be mentioned that the speech information SI may also be madeavailable to the speech recognition device 1 by means of a storagemedium or with the help of a computer network.

It should be mentioned that the stage 12 may also be implemented byhardware.

It should be mentioned that the conversion-stage implementing stage 16may also be implemented as a hardware solution.

It should be mentioned that the sub-areas of the audio signal PAS andthe items of information CHI, ASI, LI, SGI and CI corresponding theretomay also be stored in the form of so-called software objects and thatthe recognition means 18, 20, 21, 22, 23 and 24 may be arranged togenerate, alter and process such software objects. Provision may also bemade for it to be possible for the storage of the sub-areas of the audiosignal PAS and the storage or management of the items of informationCHI, ASI, LI, SGI and CI respectively, associated with them to becarried out independently by the means 18, 20, 21, 22, 23, 24 and 25. Itshould also be mentioned that the means 8, 19 and the stage 122 may beimplemented by a software object. The same is true of the recognitionmeans 18, 20, 21, 22, 23, 24 and 25. It should also be mentioned thatthe means 8, 18, 19, 20, 21, 22, 23, 24 and 25 may be implemented in theform of hardware

The means 24 forms, in the embodiment described above, a so-called“large vocabulary continuous speech recognizer”. It should however bementioned that the means 24 may also form a so-called “command andcontrol recognizer”, in which case the context or topic comprises only alexicon and no language model. Additional provisions are also made thatallow at least one grammar model to be managed.

For the purposes of the means 23 and 24, provision may also be made forthe items of information CHI, LI and SGI to be combined into so-calledphoneme model information, because the three items of informationdetermine the particular phoneme model even though the LI information isused independently of and in addition to the phoneme model informationin the case of means 23. This gives the advantage that the architectureof the speech recognition device 1 is simplified.

A further provision that may be made is for additional provision to bemade in the means 20 for so-called “hesitations” to be recognized.

1. A system for providing transcription of a conference between aplurality of participants of the conference, the system comprising: aplurality of reception stages to receive information from the pluralityof participants over a respective plurality of transmission channels;and at least one processor with programmed to receive the informationfrom the plurality of reception stages, the at least one processorfurther programmed to: analyze the information received at the pluralityof reception stages to determine which of the plurality of participantsof the conference is speaking during a given time interval based, atleast in part, on identifying which of the plurality of reception stagesis receiving speech information; select one of the plurality oftransmission channels corresponding to the reception stage identified asreceiving speech information as an in-use channel; determine channelinformation including at least one transmission parameter of the in-usechannel; extract at least one feature vector from the speech informationbased, at least in part, on the channel information; perform acousticsegmentation of the speech information to generate acoustic segmentationinformation indicating at least one segment identified in the speechinformation based, at least in part, on the channel information and theat least one feature vector, the acoustic segmentation informationincluding a label for the at least one segment of the speech informationindicating whether the at least one segment is associated with speech, apause in speech or non-speech; determine a language of the speechinformation based, at least in part, on the channel information, the atleast one feature vector and the acoustic segmentation information; andgenerate text information corresponding to words recognized in thespeech information based, at least in part, on the channel information,the at least one feature vector, the acoustic segmentation informationand the language.
 2. The system of claim 1, wherein the plurality ofreception stages include at least two of the following: at least onesound card installed in at least one computer, the sound card connectedto at least one microphone; at least one connection adapted to receiveat least one analog telephone line; at least one connection adapted toreceive at least one digital telephone line; at least one connectionadapted to receive at least one Integrated Services Digital Network(ISDN) telephone line; at least one connection adapted to receive atleast one data network channel; and at least one connection adapted toreceive a voice-over-internet-protocol (VoIP) data stream.
 3. The systemof claim 2, wherein the channel information includes bandwidthinformation of the in-use channel.
 4. The system of claim 2, wherein theat least one processor is programmed to recognize at least one key wordin the speech information based, at least in part, on the language ofthe speech information, and wherein a speech recognizer provides thetext information based, at least in part, on the at least one key word.5. The system of claim 4, wherein the at least one processor isprogrammed to recognize a speaker group associated with the speechinformation based, at least in part, on the channel information and thelanguage of the speech information, and wherein the speech recognizerprovides the text information based, at least in part, on the speakergroup.
 6. A method of providing transcription of a conference between aplurality of participants of the conference, the method comprising:receiving information over a plurality of transmission channels from theplurality of participants; using at least one processor to analyze theinformation received at the plurality of reception stages to determinewhich of the plurality of participants of the conference is speakingduring a given time interval based, at least in part, on identifyingwhich of the plurality of reception stages is receiving speechinformation; selecting one of the plurality of transmission channelscorresponding to the reception stage identified as receiving speechinformation as an in-use channel; determining channel informationincluding at least one transmission parameter that identifies the in-usechannel; extracting at least one feature vector from the speechinformation based, at least in part, on the channel information;performing acoustic segmentation of the speech information to generateacoustic segmentation information indicating at least one segmentidentified in the speech information based, at least in part, on thechannel information and the at least one feature vector, the acousticsegmentation information including a label for the at least one segmentof the speech information indicating whether the at least one segment isassociated with speech, a pause in speech or non-speech; determining alanguage of the speech information based, at least in part, on thechannel information, the at least one feature vector and the acousticsegmentation information; and generating text information correspondingto words recognized in the speech information based, at least in part,on the channel information, the at least one feature vector, theacoustic segmentation information and the language of the speechinformation.
 7. The method of claim 6, wherein receiving speechinformation over a plurality of transmission channels includes receivingspeech information via at least two of the following: at least one soundcard installed in at least one computer, the sound card connected to atleast one microphone; at least one analog telephone line; at least onedigital telephone line; at least one Integrated Services Digital Network(ISDN) telephone line; at least one data network channel; and at leastone voice-over-internet-protocol (VoIP) data stream.
 8. The method ofclaim 7, wherein the channel information includes bandwidth informationof the in-use channel.
 9. The method of claim 6, further comprisingrecognizing at least one key word in the speech information based, atleast in part, on the language of the speech information, and providingthe text information is based, at least in part, on the at least one keyword.
 10. The method of claim 9, further comprising recognizing aspeaker group associated with the speech information based, at least inpart, on the channel information and the language of the speechinformation, and wherein providing the text information is based, atleast in part, on the speaker group.
 11. A computer readable storagedevice encoded with a plurality of instructions for execution on atleast one processor, the plurality of instructions, when executed on theat least one processor, performing a method of providing transcriptionof a conference between a plurality of participants of the conference,the method comprising: receiving information over a plurality oftransmission channels from the plurality of participants; analyzing theinformation received at the plurality of reception stages to determinewhich of the plurality of participants of the conference is speakingduring a given time interval based, at least in part, on identifyingwhich of the plurality of reception stages is receiving speechinformation; selecting one of the plurality of transmission channelscorresponding to the reception stage identified as receiving speechinformation as an in-use channel; determining channel informationincluding at least one transmission parameter that identifies the in-usechannel; extracting at least one feature vector from the speechinformation based, at least in part, on the channel information;performing acoustic segmentation of the speech information to generateacoustic segmentation information indicating at least one segmentidentified in the speech information based, at least in part, on thechannel information and the at least one feature vector, the acousticsegmentation information including a label for the at least one segmentof the speech information indicating whether the at least one segment isassociated with speech, a pause in speech or non-speech; determining alanguage of the speech information based, at least in part, on thechannel information, the at least one feature vector and the acousticsegmentation information; and generating text information correspondingto words recognized in the speech information based, at least in part,on the channel information, the at least one feature vector, theacoustic segmentation information and the language of the speechinformation.
 12. The computer readable storage device of claim 11,wherein receiving speech information over a plurality of transmissionchannels includes receiving speech information via at least two of thefollowing: at least one sound card installed in at least one computer,the sound card connected to at least one microphone; at least one analogtelephone line; at least one digital telephone line; at least oneIntegrated Services Digital Network (ISDN) telephone line; at least onedata network channel; and at least one voice-over-internet-protocol(VoIP) data stream.
 13. The computer readable storage device of claim12, wherein the channel information includes bandwidth information ofthe in-use channel.
 14. The computer readable storage device of claim11, further comprising recognizing at least one key word in the speechinformation based, at least in part, on the language of the speechinformation, and providing the text information is based, at least inpart, on the at least one key word.
 15. The computer readable storagedevice of claim 14, further comprising recognizing a speaker groupassociated with the speech information based, at least in part, on thechannel information and the language of the speech information, andwherein providing the text information is based, at least in part, onthe speaker group.